Derive the unit lexer from registered symbols#29
Merged
Conversation
The tokenizer used a frozen character class (/[a-zA-Z_°'"][\w°'"]*/) to recognize unit symbols. Any registered symbol built from characters outside that class was silently dropped to a dimensionless value rather than rejected -- so Unit(1, "Ω") and Unit(1, "%") returned a unitless 1, and compatible_with? on them raised. The same latent bug affected ℃, π, the arcminute/arcsecond quotes, a.u., Å, and ℉. Build the tokenizer and symbol matcher from the symbols actually registered in the system, plus an ASCII baseline, and memoize them. #load clears the memo, so a unit defined at runtime -- including one whose symbol uses a non-ASCII glyph -- becomes lexable without touching the lexer. A symbol the system has never registered still parses to a bare unit token, preserving prior behavior for unknown ASCII symbols. Add a spec that round-trips every symbol in the default system, which is what surfaced the dropped glyphs above.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The tokenizer used a frozen character class (
/[a-zA-Z_°'"][\w°'"]*/) to recognize unit symbols. Any registered symbol built from characters outside that class was silently dropped to a dimensionless value rather than rejected -- soUnit(1, "Ω")andUnit(1, "%")returned a unitless 1, andcompatible_with?on them raised. The same latent bug affected℃,π, the arcminute/arcsecond quotes,a.u.,Å, and℉.Build the tokenizer and symbol matcher from the symbols actually registered in the system, plus an ASCII baseline, and memoize them.
#loadclears the memo, so a unit defined at runtime -- including one whose symbol uses a non-ASCII glyph -- becomes lexable without touching the lexer. A symbol the system has never registered still parses to a bare unit token, preserving prior behavior for unknown ASCII symbols.Add a spec that round-trips every symbol in the default system, which is what surfaced the dropped glyphs above.